Introduction to Machine Learning

Use Cases:

1.Fraud Detection
2.Credit Scoring & Next Best offers
3.Prediction of Equipment failures
4.New pricing models
5.Customer Segmentation
6.Text Sentiment Analysis
7.Email Spam Filtering
8.Financial Modeling

Types of Machine Learning

1. Supervised
2. Unsupervised 
3. Reinforcement Learning

Machine Learning with Python using Scikit Learn

Syntax:

from sklearn.family import Model

from sklearn.linear_model import LinearRegression

Estimator parameters: All the parameters of an estimator can be when it is instantiated, and have suitable default values

Cross Validation Types

1. Train_test_split

How to address the issue of overfitting & underfitting

Underfitting/High Bias: This form of hypothesis function h maps poorly to the trend of the data.

        Reason : 
        1. Function that is too simple
        2. Uses too few features

Overfitting/High Variance: Tjis forms the hypothesis function that fits the available data but does not generalize well to predict new data

1. Reduce the number of features:

        a. Manually select which features to keep
        b. Use model selection algorithm 

2. Regulariation 

        a. Keep all the features, but reduce the magnitude of parameters ThetaJ
        b. Regularization works well when we have a lot of slightly useful features

Linear Regression

What is Linear Regression?

Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables.

$ Y = βo + β1X + ∈ $

1. Y   -  This is the variable we predict
2. X   -  This is the variable we use to make a prediction
3. βo  -  This is the intercept term. It is the prediction value you get when X = 0
4. β1  -  This is the slope term. It explains the change in Y when X changes by 1 unit. ∈ - This represents the residual   value, i.e. the difference between actual and predicted values.

5. Error reduction Techniques
    a. Ordinary Least Square - ∑[Actual(y) - Predicted(y')]²                                                                                                                                                                                                             Why OLS?
        i. It uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent
        ii.OLS is easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features
        iii.Interpretation of OLS is much easier than other regression techniques

    b. Generalized Least Square
    c. Percentage Least Square
    d. Total Least Square
    e. Least absolute deviation


Formula for calculating the coefficients

$β1 = Σ(xi - xmean)(yi-ymean)/ Σ (xi - xmean)²$ where i= 1 to n (no. of obs.)

$βo = ymean - β1(xmean)$

Case Study : Predicting Housing Price


In [196]:
# Case Study : Predicting Housing Price

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [197]:
df = pd.read_csv("C:/Users/melvin/Machine Learning/Linear Regression/USA_Housing.csv")

In [198]:
# Summary
df.head()


Out[198]:
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price Address
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 188 Johnson Views Suite 079\nLake Kathleen, CA...
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 9127 Elizabeth Stravenue\nDanieltown, WI 06482...
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06 USS Barnett\nFPO AP 44820
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05 USNS Raymond\nFPO AE 09386

In [191]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income                5000 non-null float64
Avg. Area House Age             5000 non-null float64
Avg. Area Number of Rooms       5000 non-null float64
Avg. Area Number of Bedrooms    5000 non-null float64
Area Population                 5000 non-null float64
Price                           5000 non-null float64
Address                         5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.5+ KB

In [192]:
# Validating Linear Regression Assumptions
df.describe()


Out[192]:
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06
std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06

In [193]:
df.columns


Out[193]:
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

In [194]:
sns.pairplot(df)


Out[194]:
<seaborn.axisgrid.PairGrid at 0x195de5f8>

In [195]:
sns.distplot(df['Price'])


C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[195]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c71beb8>

In [89]:
sns.heatmap(df.corr(),annot=True)


Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x13b61ef0>

Validate the Linear Regression Model

1. There exists a linear and additive relationship between dependent (DV) and independent variables (IV)
2. Multicollinearity - present of correlation b/w independent variables
3. Heteroskedestacity- Absence of constant variance in the error terms  
4. Autocorrelation - Presences of correlation in error terms
5. The dependent variable and the error terms should possess a normal distribution

Checking whether the assumptions are violated


In [181]:
sns.pairplot(df)


Out[181]:
<seaborn.axisgrid.PairGrid at 0x17906780>

In [17]:
df.columns


Out[17]:
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

In [100]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population']]
y= df['Price']


  File "<ipython-input-100-6a15e861f2ac>", line 3
    sns.residplot(x=X,y=y,data=,color='blue')
                               ^
SyntaxError: invalid syntax

In [113]:
# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)
print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))


404
102
404
102

In [28]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

In [33]:
lm.fit(X_train,y_train)
print('Intercept = ', lm.intercept_ ,'Coefficients = ',lm.coef_)


Intercept =  -2628436.30126 Coefficients =  [  2.15423467e+01   1.64823861e+05   1.19807562e+05   2.31574320e+03
   1.52295835e+01]

In [138]:
import seaborn as sns
anscombe = sns.load_dataset("anscombe")

rs = np.random.RandomState(7)
x= rs.normal(2,1,75)
y= 2 +1.5*x+rs.normal(0,2,75)

sns.residplot(x,y,lowess=True)


Out[138]:
<matplotlib.axes._subplots.AxesSubplot at 0x148ad6a0>

In [91]:
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()
boston.feature_names


Out[91]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='<U7')

In [152]:
'''
import statsmodels.api as sm
model = sm.OLS(y_train,X_train).fit()
predictions = model.predict(X_test)

# Print out the statistics
model.summary()
'''


Out[152]:
OLS Regression Results
Dep. Variable: y R-squared: 0.957
Model: OLS Adj. R-squared: 0.956
Method: Least Squares F-statistic: 668.7
Date: Thu, 03 Aug 2017 Prob (F-statistic): 1.02e-257
Time: 14:55:47 Log-Likelihood: -1226.0
No. Observations: 404 AIC: 2478.
Df Residuals: 391 BIC: 2530.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x1 -0.1020 0.038 -2.697 0.007 -0.176 -0.028
x2 0.0426 0.017 2.483 0.013 0.009 0.076
x3 -0.0367 0.071 -0.517 0.605 -0.176 0.103
x4 3.3418 1.029 3.248 0.001 1.319 5.365
x5 -1.3299 3.850 -0.345 0.730 -8.899 6.239
x6 5.6706 0.353 16.078 0.000 4.977 6.364
x7 0.0025 0.016 0.163 0.871 -0.028 0.033
x8 -0.8610 0.220 -3.919 0.000 -1.293 -0.429
x9 0.1935 0.074 2.629 0.009 0.049 0.338
x10 -0.0090 0.004 -2.121 0.035 -0.017 -0.001
x11 -0.4256 0.127 -3.349 0.001 -0.676 -0.176
x12 0.0175 0.003 5.729 0.000 0.011 0.023
x13 -0.4571 0.059 -7.787 0.000 -0.573 -0.342
Omnibus: 151.909 Durbin-Watson: 1.990
Prob(Omnibus): 0.000 Jarque-Bera (JB): 949.294
Skew: 1.464 Prob(JB): 7.30e-207
Kurtosis: 9.916 Cond. No. 8.46e+03

In [169]:
X = boston.data
y = boston.target



# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()

lm.fit(X_train,y_train)


print('R-Square =',(lm.score(X_train,y_train) * 100))


R-Square = 72.8581829267

In [ ]:

Predictions


In [182]:
predictions = lm.predict(X_test)

accuracy = (metrics.r2_score(y_test,predictions))
print('R-Square',accuracy*100)


R-Square 77.8720987477

In [176]:
plt.scatter(y_test,predictions)


Out[176]:
<matplotlib.collections.PathCollection at 0x155c5978>

In [118]:
import seaborn as sns
sns.distplot((y_test-predictions))


C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x144e9390>

Evaluation Metrics


In [187]:
from sklearn import metrics
print('MAE = ', metrics.mean_absolute_error(y_test,predictions))
print('MAE = ',metrics.mean_squared_error(y_test,predictions))
print('MAE = ',np.sqrt(metrics.mean_squared_error(y_test,predictions)))
print('R-Square =',metrics.explained_variance_score(y_test,predictions))


MAE =  3.11306137942
MAE =  18.5121317884
MAE =  4.30257269415
R-Square = 0.780463770177

RMSE / MSE / MAE -

Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model. Let's look at them one by one:

MSE - This is mean squared error. It tends to amplify the impact of outliers on the model's accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30-10)² = 400.

MAE - This is mean absolute error. It is robust against the effect of outliers. Using the previous example, the resultant MAE would be (30-10) = 20

RMSE - This is root mean square error. It is interpreted as how far on an average, the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data. Here, the resultant RMSE would be √(30-10)² = 20. Don't get baffled when you see the same value of MAE and RMSE. Usually, we calculate these numbers after summing overall values (actual - predicted) from the data.

Logistic Regression

Logistic Regression belongs to the family of generalized linear models. It is a binary classification algorithm used when the response variable is dichotomous (1 or 0).

Examples:

1. Ham/Spam
2. Loan Defaulters(Yes/No)
3. Disease Diagnosis

Assumptions :

1. The response variable must follow a binominal distribution
2. Logistic Regression assumes a linear relationship between the independent variables and the link function(logit)
3. The dependent variable should have mutually exclusive and exhaustive categories


Note : The linear function results in probabilities outside [0,1] interval, thereby making them invalid predictions

Types of Logistics Regression

1. Multinomial Logistics Regression
2. Ordinal Logistics Regression

Multinomial Logistic Regression

The technique handles the multi-class problem by fitting K-1 independent binary logistic classifier model.

Drawback:

a. It doesn't scale well in the presence of a large number of target classes
b. Requires a larger dataset to achieve reasonable accuracy

Ordinal Logistics Regression

This technique is used when the target variable in ordinal in nature (Eg. years of work experience 5>4>3>2>1). Ordinal Logistics Regression builds a single model with mutiple threshold values.

If we have K classes, the model will require K -1 threshold or cutoff points. Also, it makes an imperative assumption of proportional odds. The assumption says that on a logit (S shape) scale, all of the thresholds lie on a straight line.

Note: Logistic Regression is not a great choice to solve multi-class problems. But, it's good to be aware of its types. In this tutorial we'll focus on Logistic Regression for binary classification task.

Binomial Distribution Characteristics
    1. There must be fixed number of trials denoted by n
    2. Each trial has only two outcomes
    3. The outcome of each trial must be independent of each other
    4. The probability of sucess(p) & failure should be the same for each trail
How Logistic Regression Works?
a. A Unit change in input feature doesn't really affect the model output directly but it affects the odds ratio

b. We use Maximum likehood method to determine the best coefficients and eventually a good model fit.(Tries to find values of βo and β1 such that the resultant probabilities are closest to either 1 or 0)
How can you evaluate Logistic Regression model fit and accuracy ?
   1. Akaike Information Criteria

       a. Counter part of Adjusted R-Square in multiple regression
       b. Smaller the better
       c. Adding more variables to the model wouldn't let AIC increase. It helps overfitting

Note: Looking at one AIC metric of one model would help. So build 2 or 3 Logistic Regression models and compare their AIC

   2. Null Deviance and Residual Deviance

       a. Deviance of an observation is computed as -2 times log likelihood of that observation
       b. The null model predicts class via a constant probability
       c. Residual deviance is calculated from the model having all the features
       d. Results: 
               i.   The larger the difference between null and residual deviance, better the model
               ii.  Lower null deviance, means that the model explains deviance pretty well, and is a better model
               iii. Lower the residual deviance, better the model 

   3. Confusion Matrix

                          1               0  
                      (Predicted)    (Predicted)

          1
       (Actual)             TP           FN (Type 2)



           0
       (Actual)             FP(Type 1)           TN  


       Metrics:

       Accuracy :It determines the overall predicted accuracy of the model.

       Accuracy = (TP + TN) / (TP + TN + FP + FN)

       True Positive Rate/Sensitivity/Recall : It indicates how many positive values, out of all the positive values.have   
       been correctly predicted 

       Sensitivity/Recall = TP / TP + FN

       False Negative Rate : 1- Sensitivity

       True Negative Rate /Specificity: It indicates how many negative values, out of all the negative values, have 
       been correctly predicted   

       FP Rate : 1 - Specificity

       Precision : It indicates how many values, out of all the predicted positive values, are actually positive.

       Precision = TP/ TP+ FP 

       F Score: F score is the harmonic mean of precision and recall. 
                It lies between 0 and 1. Higher the value, better the model.
                It is formulated as 2((precision*recall) / (precision+recall))


       4. Receiver Operator Characteristic (ROC)

       ROC determines the accuracy of a classification model at a user defined threshold value. It determines the model's              accuracy using Area Under Curve (AUC).

       Measure : Higher the area, better the model

In [347]:
# Case Study Titanic Dataset:

# Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [430]:
# Importing the requried training dataset
train= pd.read_csv("C:/Users/melvin/Machine Learning/Logistic Regression/train.csv")

In [431]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')


Out[431]:
<matplotlib.axes._subplots.AxesSubplot at 0x27bb86d8>

Findings :

       1. We are missing a lot of Cabin information
       2. A lot of Age information is also missing 
       3. One Embarked informatio is missing

Solution:

       1. For Age we can go for missing value imputation
       2. We can drop or tranform it into Categorial variables like Known/Unknown

In [432]:
sns.set_style('whitegrid')

In [433]:
sns.countplot(x='Survived',hue='Sex',data=train)


Out[433]:
<matplotlib.axes._subplots.AxesSubplot at 0x27bc2f28>

In [434]:
sns.countplot(x='Survived',hue='Pclass',data=train)


Out[434]:
<matplotlib.axes._subplots.AxesSubplot at 0x27ccc4e0>

In [435]:
sns.distplot(train['Age'].dropna(),bins=30,kde = False)


Out[435]:
<matplotlib.axes._subplots.AxesSubplot at 0x27ddc6d8>

In [436]:
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

In [437]:
sns.countplot(x='SibSp',data=train,hue='Survived')


Out[437]:
<matplotlib.axes._subplots.AxesSubplot at 0x27ee8320>

In [438]:
Finding :
    
    1. We are able to see that the people who had no siblings or 1 siblings have died. Its opposite of what I thought.


  File "<ipython-input-438-941e121aa2d0>", line 1
    Finding :
             ^
SyntaxError: invalid syntax

In [439]:
'''
import cufflinks as cf
cf.go_offline()
train['Fare'].iplot(kind='hist',bins=50)
'''


Out[439]:
"\nimport cufflinks as cf\ncf.go_offline()\ntrain['Fare'].iplot(kind='hist',bins=50)\n"

Cleaning our data

1. Filling the missing values with mean/median

In [440]:
plt.figure(figsize=(10,7))
sns.boxplot(y='Age',x='Pclass',data=train)


Out[440]:
<matplotlib.axes._subplots.AxesSubplot at 0x2816a6a0>

In [441]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[0]
    
    if pd.isnull(Age):
        
        if Pclass == 1:
                return 37
        elif Pclass == 2:
                return 29
        else:
            return 24
    else:
        return Age
    
     
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

In [442]:
plt.figure(figsize=(18,8))
sns.heatmap(train.isnull(),yticklabels=False,cmap='viridis')


Out[442]:
<matplotlib.axes._subplots.AxesSubplot at 0x2827f9e8>

In [443]:
train.drop('Cabin',axis=1,inplace=True)

In [444]:
train.dropna(inplace=True)

In [445]:
# Converting the Categorcial Variable into dummy variable
sex = pd.get_dummies(train['Sex'],drop_first=True)
sex.count()


Out[445]:
male    889
dtype: int64

In [446]:
embark = pd.get_dummies(train['Embarked'],drop_first=True)
embark.count()


Out[446]:
Q    889
S    889
dtype: int64

In [447]:
train = pd.concat([train,sex,embark],axis=1)
train


Out[447]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked male Q S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 1 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 0 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 0 1
5 6 0 3 Moran, Mr. James male 24.0 0 0 330877 8.4583 Q 1 1 0
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 S 1 0 1
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 S 1 0 1
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 S 0 0 1
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 C 0 0 0
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 S 0 0 1
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 S 0 0 1
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 S 1 0 1
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 S 1 0 1
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 S 0 0 1
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 S 0 0 1
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 Q 1 1 0
17 18 1 2 Williams, Mr. Charles Eugene male 24.0 0 0 244373 13.0000 S 1 0 1
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 S 0 0 1
19 20 1 3 Masselmani, Mrs. Fatima female 24.0 0 0 2649 7.2250 C 0 0 0
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 S 1 0 1
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 S 1 0 1
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 Q 0 1 0
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 S 1 0 1
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 S 0 0 1
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 S 0 0 1
26 27 0 3 Emir, Mr. Farred Chehab male 24.0 0 0 2631 7.2250 C 1 0 0
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 S 1 0 1
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female 24.0 0 0 330959 7.8792 Q 0 1 0
29 30 0 3 Todoroff, Mr. Lalio male 24.0 0 0 349216 7.8958 S 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 S 1 0 1
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 S 0 0 1
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female 24.0 8 2 CA. 2343 69.5500 S 0 0 1
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 S 1 0 1
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 S 0 0 1
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 C 0 0 0
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 S 1 0 1
868 869 0 3 van Melkebeke, Mr. Philemon male 24.0 0 0 345777 9.5000 S 1 0 1
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 S 1 0 1
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 S 1 0 1
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 S 0 0 1
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 S 1 0 1
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 S 1 0 1
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 C 0 0 0
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 C 0 0 0
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 S 1 0 1
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 S 1 0 1
878 879 0 3 Laleff, Mr. Kristo male 24.0 0 0 349217 7.8958 S 1 0 1
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C 0 0 0
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 S 0 0 1
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 S 1 0 1
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 S 0 0 1
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 S 1 0 1
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 S 1 0 1
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 Q 0 1 0
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 S 1 0 1
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 S 0 0 1
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 24.0 1 2 W./C. 6607 23.4500 S 0 0 1
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C 1 0 0
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 Q 1 1 0

889 rows × 14 columns


In [448]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [449]:
train.head()


Out[449]:
PassengerId Survived Pclass Age SibSp Parch Fare male Q S
0 1 0 3 22.0 1 0 7.2500 1 0 1
1 2 1 1 38.0 1 0 71.2833 0 0 0
2 3 1 3 26.0 0 0 7.9250 0 0 1
3 4 1 1 35.0 1 0 53.1000 0 0 1
4 5 0 3 35.0 0 0 8.0500 1 0 1

In [450]:
train.drop(['PassengerId'],axis=1,inplace=True)

In [451]:
train


Out[451]:
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1
5 0 3 24.0 0 0 8.4583 1 1 0
6 0 1 54.0 0 0 51.8625 1 0 1
7 0 3 2.0 3 1 21.0750 1 0 1
8 1 3 27.0 0 2 11.1333 0 0 1
9 1 2 14.0 1 0 30.0708 0 0 0
10 1 3 4.0 1 1 16.7000 0 0 1
11 1 1 58.0 0 0 26.5500 0 0 1
12 0 3 20.0 0 0 8.0500 1 0 1
13 0 3 39.0 1 5 31.2750 1 0 1
14 0 3 14.0 0 0 7.8542 0 0 1
15 1 2 55.0 0 0 16.0000 0 0 1
16 0 3 2.0 4 1 29.1250 1 1 0
17 1 2 24.0 0 0 13.0000 1 0 1
18 0 3 31.0 1 0 18.0000 0 0 1
19 1 3 24.0 0 0 7.2250 0 0 0
20 0 2 35.0 0 0 26.0000 1 0 1
21 1 2 34.0 0 0 13.0000 1 0 1
22 1 3 15.0 0 0 8.0292 0 1 0
23 1 1 28.0 0 0 35.5000 1 0 1
24 0 3 8.0 3 1 21.0750 0 0 1
25 1 3 38.0 1 5 31.3875 0 0 1
26 0 3 24.0 0 0 7.2250 1 0 0
27 0 1 19.0 3 2 263.0000 1 0 1
28 1 3 24.0 0 0 7.8792 0 1 0
29 0 3 24.0 0 0 7.8958 1 0 1
... ... ... ... ... ... ... ... ... ...
861 0 2 21.0 1 0 11.5000 1 0 1
862 1 1 48.0 0 0 25.9292 0 0 1
863 0 3 24.0 8 2 69.5500 0 0 1
864 0 2 24.0 0 0 13.0000 1 0 1
865 1 2 42.0 0 0 13.0000 0 0 1
866 1 2 27.0 1 0 13.8583 0 0 0
867 0 1 31.0 0 0 50.4958 1 0 1
868 0 3 24.0 0 0 9.5000 1 0 1
869 1 3 4.0 1 1 11.1333 1 0 1
870 0 3 26.0 0 0 7.8958 1 0 1
871 1 1 47.0 1 1 52.5542 0 0 1
872 0 1 33.0 0 0 5.0000 1 0 1
873 0 3 47.0 0 0 9.0000 1 0 1
874 1 2 28.0 1 0 24.0000 0 0 0
875 1 3 15.0 0 0 7.2250 0 0 0
876 0 3 20.0 0 0 9.8458 1 0 1
877 0 3 19.0 0 0 7.8958 1 0 1
878 0 3 24.0 0 0 7.8958 1 0 1
879 1 1 56.0 0 1 83.1583 0 0 0
880 1 2 25.0 0 1 26.0000 0 0 1
881 0 3 33.0 0 0 7.8958 1 0 1
882 0 3 22.0 0 0 10.5167 0 0 1
883 0 2 28.0 0 0 10.5000 1 0 1
884 0 3 25.0 0 0 7.0500 1 0 1
885 0 3 39.0 0 5 29.1250 0 1 0
886 0 2 27.0 0 0 13.0000 1 0 1
887 1 1 19.0 0 0 30.0000 0 0 1
888 0 3 24.0 1 2 23.4500 0 0 1
889 1 1 26.0 0 0 30.0000 1 0 0
890 0 3 32.0 0 0 7.7500 1 1 0

889 rows × 9 columns


In [452]:
train.count()


Out[452]:
Survived    889
Pclass      889
Age         889
SibSp       889
Parch       889
Fare        889
male        889
Q           889
S           889
dtype: int64

In [453]:
X = train.drop('Survived',axis=1)
y = train['Survived']

In [456]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.20,random_state = 1)

In [464]:
from sklearn.linear_model import LogisticRegression
Logistic_M = LogisticRegression(n_jobs=5)

In [465]:
Logistic_M.fit(X_train,y_train)


Out[465]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=5,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [466]:
predictions = Logistic_M.predict(X_test)

In [467]:
from sklearn.metrics import classification_report

In [468]:
print(classification_report(y_test,predictions))


             precision    recall  f1-score   support

          0       0.85      0.87      0.86       105
          1       0.80      0.78      0.79        73

avg / total       0.83      0.83      0.83       178


In [469]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)


Out[469]:
array([[91, 14],
       [16, 57]])

K Nearest Neighhors

KNN is a classification algorithm.

How it works?

1. 


Pro's:

1. Very Simple
2. Training is trivial
3. Works with any number of classes
4. Easy to add more data
5. Few parameter
    a. K
    b. Distance Metric

Con's :

1. High Prediction Cost
2. Not good with high dimensional data
3. Categorial Features don't work well

KNN Use Case :


In [3]:
# importing the requried Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [5]:
# Importing the Requried Dataset
data = pd.read_csv("C:/Users/melvin/Machine Learning/KNN/KNN_Project_Data.csv")
data.head()


Out[5]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC TARGET CLASS
0 1636.670614 817.988525 2565.995189 358.347163 550.417491 1618.870897 2147.641254 330.727893 1494.878631 845.136088 0
1 1013.402760 577.587332 2644.141273 280.428203 1161.873391 2084.107872 853.404981 447.157619 1193.032521 861.081809 1
2 1300.035501 820.518697 2025.854469 525.562292 922.206261 2552.355407 818.676686 845.491492 1968.367513 1647.186291 1
3 1059.347542 1066.866418 612.000041 480.827789 419.467495 685.666983 852.867810 341.664784 1154.391368 1450.935357 0
4 1018.340526 1313.679056 950.622661 724.742174 843.065903 1370.554164 905.469453 658.118202 539.459350 1899.850792 0

In [6]:
sns.pairplot(data=data)


Out[6]:
<seaborn.axisgrid.PairGrid at 0x4d182e8>

In [7]:
# Standardize the Variables

In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=data.columns[:-1])
df_feat.head()


Out[8]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC
0 1.568522 -0.443435 1.619808 -0.958255 -1.128481 0.138336 0.980493 -0.932794 1.008313 -1.069627
1 -0.112376 -1.056574 1.741918 -1.504220 0.640009 1.081552 -1.182663 -0.461864 0.258321 -1.041546
2 0.660647 -0.436981 0.775793 0.213394 -0.053171 2.030872 -1.240707 1.149298 2.184784 0.342811
3 0.011533 0.191324 -1.433473 -0.100053 -1.507223 -1.753632 -1.183561 -0.888557 0.162310 -0.002793
4 -0.099059 0.820815 -0.904346 1.609015 -0.282065 -0.365099 -1.095644 0.391419 -1.365603 0.787762

In [10]:
# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features,data['TARGET CLASS'],test_size=0.30)

from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier()
Knn.fit(X_train,y_train)
prediction = Knn.predict(X_test)

In [11]:
# Metrics

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))


[[113  32]
 [ 33 122]]
             precision    recall  f1-score   support

          0       0.77      0.78      0.78       145
          1       0.79      0.79      0.79       155

avg / total       0.78      0.78      0.78       300


In [12]:
# Create a KNN model instance with n_neighbors=n
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [13]:
# Fit this KNN model to the training data

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')


Out[13]:
<matplotlib.text.Text at 0x132445f8>

In [18]:
from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier(n_neighbors=20)
Knn.fit(X_train,y_train)
prediction = Knn.predict(X_test)

In [19]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))


[[120  25]
 [ 29 126]]
             precision    recall  f1-score   support

          0       0.81      0.83      0.82       145
          1       0.83      0.81      0.82       155

avg / total       0.82      0.82      0.82       300

Done!


In [ ]:
# Random Forest